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Abstract 

In machine learning there is considerable interest in techniques which improve planning 
ability. Initial investigations have identified a wide variety of techniques to address this is- 
sue. Progress has been hampered by the utility problem, a basic tradeoff between the benefit 
of learned knowledge and the cost to locate and apply relevant knowledge. In this paper we 
describe the COMPOSER system, COMPOSER embodies a probabilistic solution to the util- 
ity problem. It is hnplemented in the PRODIGY architecture. We compare COMPOSER to 
four other approaches which appear in the literature. 



This research is supported by the National Science Foimdation, grant NSF IRI 87-19766 



1 INTRODUCTION 

Increasingly, machine learning is entertained as a mechanism for improving the efficiency 
of planning systems. Investigation in this area has identified a wide airay of techniques in- 
cluding macro-operators [DeJong86, Fikes72, Mitchell86], chunks [Laird86], and control 
niles [MintonSS, Mitchell83]. With these techniques comes a growing battery of successful 
demonstrations m domains ranging from 8-puzzle to Space Shuttle pay load processing. Un- 
fortunately, in what is now called the utility problem, learned knowledge can hurt perform- 
ance [Mintcn88]. This is underscored by a growing body of demonstrations where learning 
degrades planning performance [Etzioni90b, Gratch91a, Minton85, Subramanian90]. 

In an earlier paper we elaborated the limitations in a particular learning approach — PRODI- 
GY/EBL [Gratch91b]. Hiat paper also sketched the COMPOSER system which is one solu- 
tion to these limitations. COMPOSER is mtcnded as a general solution to the utility problem 
which provides probabilistic guarantees of improvement via learning. In this paper we detail 
ourapproach and report on an extensive series of empirical evaluations. These tests compare 
COMPOSER'S learning criterion against the approaches adopted by PRODIGY/EBL, STATIC 
[Etziom90b], DYNAMIC [Etzioni90a], and PALO [Greinei92]. These results substantiate 
our earlier analyses. They also cast doubt on the efficacy of nonrecursive control knowledge. 
This is significant since the issue of nonrecursive control knowledge has received consider- 
able attention in recent literature [Etzioni90b, Letovsky90, Subranianian90]. 

2 LEARNING AS SEARCH 

Learning can be viewed as a transformational process in which the learning systeni applies 
a series of transformations to the origmal problem solver (see [Gratch90a, Greinei92]). The 
intended result is more effective planning behavior. Typically a planner is transformed with 
control knowledge. Different forms of control knowledge include macro-operators [Brav- 
ermanSS, Laird86, Markovitch89], control rules [Drunmiond89, Etzioni90b, Mmton88, 
Mitchcll83], and static board evaluation functions [Utgoff91]. 

The transformations available to a learner define its vocabulary of transformations. These 
are essentially leammg operators and collectively they define a transformation space. For 
instance, acquiring a macroK)perator can be viewed as transforming the initial system (the 
original planner) into a new system (tiie planner operating with tiie macro-operator). A 
learning system must explore this space for a sequence of transfonnations which result in 
a better planner. 

To evaluate different learning approaches we must clarify our intuitive notions of when one 
planner is more efficient than another. For this paper, we characterize planners tiirough a 
numeric utility function which ranks the behavior of a planner over a fixed distribution of 
problems. In particular, we equate efficiency with minimizing planning time. Other mea- 
sures arc possible and our approach could apply to them as well. For any given problem, 
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utility increases as the time to solve the problem decreases. The utility of a planner is defined 
with respect to a particular problem distribution as the sum of problem utilities weighted by 
the probability of occurrence of each problem: 

UnUTYipUmnerd - Tim€jOost(plwm€rt,prob) X Pr<j)rob) 

Note that higher utility does not entail that the planning time of any particular problem is 
reduced. Rather, the expected cost to solve any representative sample of problems is less. 

Utility is a preference, function which ranks different planners. It is also useful to discuss 
the utility of individual transformations. The incremental utility of a transformation is de- 
fined as the change in utility that results from applying the transfnmation to a particular 
planner (e.g. adopting a caitrol rule). This means tfie incremental utility of a transformation 
is conditional cxi the planner to which it is applied We *Jcnote this as: AUTILrrY(Ttansfonna- 
tionlPlanncr). Applying a transformation with positive incremental utility results in a more 
effective planner. A learning system need not explicitly compute utility values to identify 
preferred planners, but it must act (at least approximately) as if it does. In fact many learning 
systems do not explicitly evaluate utility. 

3 COMPOSER 

COMPOSER uses the previous definition of utility to evaluate and adopt control knowledge 
which, with high probability, improves planning performance. Its design was motivated by 
deficiencies in PRODIGY/EBL. Another paper illustrates how these deficiencies are shared 
by many otiier speed-up learning techniques [Gratch92]. In this paper we focus on die im- 
plementation of COMPOSER. In [Gratch91b] we note that PRODIGY/EBL adopts two heu- 
ristic simplifications to identify beneficial control rules. First, aspects of the problem distri- 
bution are learned from a single example. Secondly, control rules are treated as if they do 
not interact These simplifications have the unfortunate consequence that PRODIGY/EBL 
can learn control strategies which yield planners which are up to an order of magnitude slow- 
er than the original planner. We replace this heuristic approach witii a rigorous alternative. 

3,1 Algorithm 

The current implementation of COMPOSER learns control knowledge in die form of control 
rules. Other transformations could be adopted by suitably altering the statistics gathering 
procedure described in Section 3.2. COMPOSER is implemented within die PRODIGY 2.0 
architecture. This system includes the PRODIGY planner which is a STRIPS-like system. 
It identifies plans through depth-first search. The leammg component of PRODIGY/EBL 
analyzes solution traces and proposes control rules to correct any observed inefficiencies. 
These control rules are condition-action statements which inform the PRODIGY planner to 



delete or prefer certain node, operator, or vaiiable binding choices. COMPOSER primarily 
utilizes selection and rejection rules. This is discussed further in Section 5. 

COMPOSER differs from PRODIGY/EBL in how statistics are gathered and how control rules 
are introduced into the PRODIGY planner. We implement a hill-climbing approach to the 
utility problem. The basic algorithm is sketched in Figure 1 . We assume the user has pro- 
vided a training set which is drawn according to the distribution of problems. 

Input.' TRAINING.EXAMPLES 

CONTROL_STRArEGY - 0 
CANDIDATE_SET - 0 
While TRAINING_EXAMPLES 

solve problem with PRODIGY+CONTROL.STRATEGY 
learn new rules and add them to CANDIDATE.SET 
gather statistics for all rules in CANDrOATE.SET 
POSITrVE_RULES - 0 
Forall rules e CANDIDAre.SET 

If UTIUTY(nilelPRODIGY+CONTROL_STRArEGY) significantly negative 

remove rule from CANDBDATE.SET 
If LmLITY(rulelPRODIGY+CONTROL_STRATEGY) significanUy positive 
add rule to POSrnVE_RULES 
IfPOSmVE_RULES 

append rule with highest urility to CONTOOL.STRATEGY 
remove this rule from CANDIDATE_SET 
discard all statistics on rules in CANDIDAIE_SET 

Output: CONTROL.STRATEGY 
Figure 1: The COMPOSER algorithm 



Learning occurs with a single pass through the training examples. The algorithm incremen- 
tally adds control mles to a currently adopted control strategy. A rule is added only if it has 
demonstrated its benefit to a pre-specified confidence level. Once added, the rule changes 
how the planner behaves on subsequent training examples. New roles are proposed, and sta- 
tistics gathered, with respect to the current ccaiti-ol strategy. In this manner a control strategy 
is "grown" one rule at a time until the training set is exhausted. 



3.2 Gathering Incremental Utility Statistics 

Gathering incremental utility statistics is ttie one aspect of COMPOSER which ties it to a par- 
ticular lepresentatiOT for control knowledge — namely control rules. Other transformations 
would require arialogous data gatiiering procedures. 

A control mle shaild only be adopted if it improves the efficiency of tlie problem solver on 
average. This average can be estimated by determining how the mle performs on individual 
problems and combining information from several problems. The next section discusses 
how to combine inf ormationo But first we will describe how COMPOSER extracts incremen- 
tal utility values on individual problems. 

How can we determine the incremental utility of a control rule on a particular problem? The 
obvious approach is to solve the problem twice — once using the current control strategy 
without the rule m question, and once using the strategy augmented with the Ciindidate rule. 
The difference in problem solving cost between these two runs is the incremental utility of 
the control rule on that problem* This process must be repeated for every ruie in the candidate 
set Qearly this approach is too expensive in practice. 

COMPOSER implements a more eflScient approach for gathering incremental utility values. 
It can extract a utility value for each candidate rule simultaneously from a single solution 
trace. While PRODIGY/EBLalso derives multiple estimates from a single example, its tech- 
nique is rendered inaccurate by the mteractions which occur amcxig rules (see LGratch91b]). 
COMPOSER solves the interaction problem by extracting estimates without allowing the 
candidate rules to change the search behavior of the planner. Control rules only effect search 
behavior if they are adopted into the control strategy. 

In com _.st to adopted rules, the actions of candidate rules are not acted upon. They are sim- 
ply noted in. the problem solving trace. After a problem is solved, COMPOSER analyzes the 
annotated t;ace, and identifies the search paths which would have been avoided by each rule. 
The time spent exploring these avoidable paths indicates the savings which would be pro- 
vided by the rule. This savmgs is compared with the recorded precondition match cost, and 
the difference is reported as the inaemental utility of the rule for that problem. More details 
may be foimd in [Gratch90b]. 

It should be noted that this procedure is somewhat more expensive than the heuristic ap- 
proach adopted by PRODIGY/EBL. This is because COMPOSER pays the penalty of match- 
ing preconditions without acquiring any of the benefit of candidate control mlcs. We are not 
aware c ^ a reliable technique which avoids this additional cost. 

3.3 Commitment Criterion 

The incremental utility of a transformation across the problem distribution can be estimated 
by averaging utility values from several problems. COMPOSER uses average incremental 
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utility to decide a control rulers fate. The system must apply only transfonnations which 
have positive incremental utility. Additionally, as the data gathering process is expensive, 
the system musi discard candidate transformations which have incremental negative utility. 
Both of the choices should be determined accurately but with as few examples as possible, 
In the field of statistics this is referred to as a sequential analysis problem. Observations are 
gathered until some stopping critericHi is satisfied* As this criterion will commit COMPOSER 
to adopting or discarding a transformation we refer to this as a commitment criterion. In this 
case we are estimating the mcremental utility of transformations to some specified confi- 
dence. We require the user to provide an error parameter, 8, which specifies the acceptable 
probability of applying or discarding a transformation incorrectly,^ 

Formally, COMPOSER must choose among two hypotheses for each candidate: 

Hq: AUTIIJnrY(iulelplamier+controLstrategy) <« 0, 
or 

Hi: AUTimT(rulelplamier-HX)ntroLstiategy) > 0 

The average incremental utility is only an estimate of the true incremental utility. This esti- 
mate will differ torn the true value, so the system must bound the discrepancy. In particular, 
if the rule is negative, the system must bound the probability that it will appear positive, and 
vice versa. This is equivalent to bounding the probability that the difference between the true 
utility and the estimate is larger than the magnitude of the true utility: 

PrCIESTOVlATE-LmLITyl > lUTlLnTl ) - 5 

Nfidas [N^das69] describes a distribution-finee commitment criterion which applies, .After 
taking M examples the probability of error is (approximately) 8, where M is defined as: 

M- minn>i in: (VrA)^< n(Va9} 
where is the average utility of the rule r over n problems, Xrj is the utility of r on the 
ith problem, and Vr^n - S Q^rj -Xpi)^ is an mdicator of the variance in the sample. The param- 
eter ^2 satisfies the constramt that <S>(a) - S/2, where O is the cumulative distribution function 
of the standard normal distribution. In simpler words, COMPOSER take examples until the 
mequality, (Yr^J^^)^ < n(XI^)^^ is satisfied 

The technique approximates the user specified level If the user specifies an error level of 
6, the true error level wUl be close to but not necessarily equal to 5 — it may be slightly more 
or less. The discrepancy is a function of the underlying distribution, Woodroofc provides 
a furtlier analysis which indicates the approximation is very close in practice [Woo- 
dtoofe82]. 



1 . Alternatively we could rcquirt that 6 represent the cnmulative enxx across all applied transfonnations* 
This requires determining a 5i at each step such that the sum of all 5i*s equals 



In summary, the commitment criterion permits COMPOSER to identify when transforma- 
tions are beneficial with some pro-specified probability. After each problem solving at- 
tempt, COMPOSER updates the statistics and evaluates the commitment criterion for each 
c<xitrol rule in the candidate set If no control rule has attained the confidence requirement, 
another problem is solved. If the commitment criterion identifies ccxitrol rules with positive 
incremental utility (there may be more than one), COMPOSER adds the control rule with 
highest positive incremental utility to the current strategy, and removes it from the candidate 
set Statistics for the remaining candidates arc discarded as they arc conditional on the pre- 
vious COTtrol strategy, and meaningless in the context of the new strategy. If the commitment 
criterion identifies candidate rules with negative incremental utility, they are eliminated 
from the candidate set. Eliminating a candidate does not affect the current strategy, so the 
statistics associated with the remaining candidate control rules arc not discarded This cycle 
is repeated until the training set is exhausted. Each time a transformation is adopted the effi- 
ciency of the PRODIGY plaimer is increased, giving COMPOSER an. anytime behavior 
[Dean88]. 

4 EVALUATION 

We evaluated COMPOSER'S conmaitment criterion against several other conmiitment crite- 
ria. Before discussing the experiments we will review these other criteria. 

4.1 PRODIGY/EBL's UtiHty Analysis 

PRODIGY/EBL adopts transformations with a heuristic utility analysis. As control rules are 
proposed they arc added to the current control strategy. The savings afforded by each nile 
is estimated from a single example and this value is credited to the rule each time it applies. 
Match cost is measured directly. If the cumulative cost exceeds the cumulative savings, the 
rule is removed fi^om the current control strategy. The issue of interactions among transfor- 
matioos is not addressed, 

4.2 STATIONS Nonrecursive Hypothesis 

STAnC^s commitment criterion is based on Etzioni's structural theory of utility. The criteri- 
on is grounded in the nonrecursive hypothesis which states that transformations will have 
positive incremental utility, regardless of problem distribution, if tliey are generated fi-om 
ncxirecursive explanations of planning beiiavior. An explanation is defined as recursive if 
a predicate in a subgoal is derived usmg another instantiation of the same predicate. The issue 
of mteractions between transfOTnations is not addressed STATIC applies this criterion to 
control mles but the idea has been applied to macroH3perators as well [Letovsky90, Subra- 
manian90]. 

STATIC has exceeded PRODIGY/EBL's performance on a number of domains, including one 
domain for which PRODIGY/EBL degrades planning performance. The nonrecursive hy- 
pothesis is cited as the principle reason for this success [Etzioni90b]. This claim is difficult 



to eva'ua? as these systems can generate very dififerent control rules. We clarify this issue 
by testing the nOTrecursive hypothesis on PftODIGY/EBL's learning component. 

We conamicted the STAnC-RI system as a re-implementation of STATIC'S nonrccursive hy- 
pothesij- within the PRODIGY/EBL framework. STATIC-RI replace the commitment criteri- 
on of PRODIGY/EBL ^th the nonreoursive hypothesis. Instead of utility analysis, as rales 
are propot t^d, STATIC-RI adopts each rule which is based on a nonrecursive explanation and 
discards es.!3fe nile which is based on arccursive explanation. In all other respects STATIC-RI 
is identical to PRODIGY/EBL. 

43 DYNAMIC! A Composite System 

Etzioni h^^& suggested that the strengths of STAnc and PRODIGY/EBL could be combined 
into a single system [Etzioni90a]. The proposed DYNAMIC system incorporates a two lay- 
ered utility criterion. The nonrecursive hypothesis acts as in initial filter, but the remaining 
nonrecursive control rales arc subject to utility analysis and may be later discarded. 

We implemented the DYNAMIC-RI system to test this learning criterion. As control rules 
are proposed by PRODIGY/EBL's learning module, they are first filtered on the basis of the 
nonrecursive hypothesis. Tne remain rules undergo utility analysis as in PRODIGY/EBL. 
4.4 PALO's Chemoflf Bounds 

Greiner and Cohen have recently proposed an approach which is similar to COMPOSER. 
The Probably Approximately Locally Optimal (PALO) approach also adopts a hill-climbing 
technique and evaluates ti-ansformations by a statistical method. The primary difference be- 
tween our two methods is the commitment criteria. COMPOSER uses the sequential analysis 
technique of Nidas, while PALO uses Chemoff bounds. 

Both techniques requue similar assumptions — namely that the problem distribution is 
fixed, training examples are randomly drawn from tfiis distribution, and that tiie distribution 
of utility values over this problem distribution possesses a finite variance. The difference 
is tiiat Chemoff bounds provide somewhat stronger guarantees at die cost of more examples. 
The N^das technique implements approximate significance levels — the true error level wUl 
be close to but not necessarily equal to 6. PALO's technique provides worst case bounds. 
This means that if the user specifies an error level of 8, the true error level will never exceed 
8, and may in fact be much lower. 

The PALO-RI system evaluates tius approach. PALO-RI is a modification of COMPOSER 
where the Nddas technique is replaced witii PALO's commitment criterion. Examples are 
gathered until a control rule satisfies the following inequality: 




n 
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where Xr,i is the utility of rule r on problem i, and A is a parameter which is related the maxi- 
mal cost of solving problems.^ One disadvantage of this technique is it is difficult to assess 
the optimal value for A and the chosen value can significantly impact the number of of exam- 
ples required by the method Our resolution of this issue is discussed in the experiments. 
4.4 Experiments 

COMPOSER was tested on the STOPS domain from [Minton88], the AB-WORLD domain 
from [Etzioni90a] for which PRODIGY/EBL produced harmful strategies, and the BIN- 
WORLD domain from [Gratch9la] which yielded detrimental results for both STATIC and 
PRODIGY/EBL's learning criteria. The results are summarized in Figure 2. In each domain 
the systems are trained on 1 00 training examples drawn randomly from a fixed distribution. 
A learning curve is generated by saving the current control strategy after every twenty train- 
ing examples.^ The graphs illustrate learning curves where the independent measure is the 
number of traiidng examples and the dependent measure is execution time for 100 test prob- 
lems drawn from the same distribution. This process is repeated five times, using different 
but identically distributed training and test sets. Values in Figure 2 represent the average of 
these five trials. The header "# Rules" mdicates the average number of mles learned by the 
system; 'Train Time" is the number of seconds required to process the 100 training exam- 
ples; 'Test Time" refers to the number of seconds required to generate solutions for the 100 
test problems. 

As we stated earlier, COMPOSER does not implement a general approach to evaluating pref- 
erence rales. In particular, it carmot property evaluate the incrcmenial utility of preference 
rules in the AB-WORLD and STRIPS domains. To ensure that differences reflect the commit- 
ment criteria and not tlie vocabulary of transformations, we disabled the learning of prefer- 
ence rules for every system in the STOIPS and AB-WORLD domains. We evaluated the rami- 
fications of this change by comparing PRODIGY/EBL with and without preference rules 
and found that, in both domains, more efficient strategies rcciultcd when preference rules 
were disabled. This is consistent with statements made by Minton conceming preference 
rules [Minton88 p. 129]. COMPOSER and PALO-RI require a parameter which represents 
the confidence level far adding a transfoimation. This was set at 95%. For PALO-RI's A 
parameter, we tried to assign a value which is close to the maximal problem solving cost 
without going under. It was set as follows: AB-WORLD - 75 seconds, STRIPS - 150 sec- 
onds, BIN-WORLD - 150 seconds. 

2. More precisely, this is the maximal cost of the problem solver with the current control strategy plus the 
maximum problem solving cost of the problem solver using the control rule which we are analyzing. 

3. PRODIGY/EBL's utility analysis requires an additional settling phase after training. Each control strategy 
produced by PRODIGY/EBL and DYNAMC-RI received a setUing phase cf 20 problems where learning was 
disabled by utility analysis continued to filter rules. 
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Figure 2: Summary of empriical results 



It was quickly apjjarent that PALO-RI would not adopt any transformations within the 100 
training examples. We tried to give the system enough examples to reach quiescence but this 
proved too expensive. The problem is twofold ■— first, too many training examples were 
required; secondly, and as a consequence of the first problem, the candidate set grew large 
since harmful rules were not discarded as quickly as in COMPOSER. This increased the cost 
to solve each training example. To collect statistics on PALO-RI we only performed one in- 
stead of five learning trials. Furthermore, we terminated PALO-RI after the first transforma- 
tion was adopted or 10,000 examples, whichever came first. 
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4.5 Discussion 

The results illustrate several interesting features. COMPOSER exceeded the performance of 
all other approaches in every domain. In AB-WORLD and STOIPS, COMPOSER identified 
beneficial control strategies. In BIN-WORLD the system did not adopt any transformations. 
It does not appear that any control rule improves performance in this domain. It should be 
stressed that ^ systems utilized the same learning module. Thcrefoie the results represent 
differences in commitment strategies rather than in the vocabulary of transformations. 

As expected, COMPOSER and PALO-RI had the highest learning tunes as they incur the pre- 
conditicm cost of candidate control mles without gaining the benefit of their recommenda- 
tions. The one exception was BIN-WORLD where COMPOSER quickly discarded a very ex- 
pensive control rule which PRODIGY/EBL, STATIC-RI, and DYNAMIC-RI retained. An 
encouraging result is that COMPOSER'S leammg tunes were not substantially higher than 
the non-statistical systems. PALO-RI's learning times were significantly higher. 

The results cast doubts on die nonrecursive hypothesis. STATIC-RI yielded the worst per- 
formance on all domains. Even in conjunction with utility analysis tiie results are mixed — 
benefit on tiie AB-WORLD, slightiy worse tfian utility analysis alone in STRIPS, and worse 
than no-learning in BIN-WORLD. A post-hoc analysis of control strategies did indicate that 
the best rules were nonrcciusive, but many nonrecursive rules were also detrimental. The 
slow-down on BIN-WORLD primarily results from one nonrcciusive control rule. Thus it 
appears that nonrecursiveness may be an important property but is insufRcient to ensure per- 
formance improvements. These results are interesting since Etzioni reports that STATIC out- 
performs PRODIGY/EBL and No Learning in AB-WORLD. The nonrecursive hypothesis 
cannot account for tiiis difference. We attribute the difference to tiie fact that STATIC and 
STATIC-RI entertain different sets of control rules. STATIC-RI was constrained to use the 
vocabulary which was available to PRODIGY/EBL while STATIC has its own rule generator. 

Finally, altiiough PALO-RI did not unprove performance witinn die 100 training examples, 
we believe that if it were given sufficient examples it would out perform all other systems. 
Witii extended examples it did exceed COMPOSER'S performance in AB-WORLD. This is 
because the PALO approach commits to transformations with highest mcremcntal utility 
while COMPOSER balances incremental utiOlty against variance. Unfortunately the cost of 
PALO's performance unprovement is very high, botii in terms of examples and learning tune. 
. Thus, while COMPOSER may identify somewhat less beneficial strategies, it achieves much 
faster convergence. 

5 LIMITATIONS AND FUTURE RESEARCH 

Our investigations have exposed two important issues for future research. 
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5.1 Preference Rules 

There are difficulties in extending COMPOSER'S utility gathering approach to preference 
rules. Itiseasytorecordthematchcostfortheserules. The problem stems from determining 
how much a rule would save if it were added to the control strategy. This is straightforward 
m the case of rules which delete alternatives. The search space explored by the planner using 
such a rule will always be a subset of the search search space explored without the rule. This 
is not necessarily the case with preference rules. A candidate preference rule might suggest 
a search path which was not explored when the training example was solved (this can arise 
if there are multiple solutions to the training example). To determme the potential savmgs 
of the preference rule under these circumstances, the system must re-invoke the planner and 
explore this altemative path. This need may arise many times in one problem and other can- 
didate preference rules might reqmre even different paths to be expanded. 

This discussion pomts to a general issue that some transformation vocabularies may be easier 
to unplement within the COMPOSER frameworic than others. Perhaps the issue can be re- 
solved by identifymg altemative means to gather utility values. This problem disappears if 
we are willing to solve training problems twice — once with and once without the transfor- 
maticm — but this is unlikely to be feasible in practice. 

5.2 Commitment Criteria 

Both Greiner and Cohen's approach and our own provide probabilistic guarantees of im- 
provement though learning. The commitment criteria used by these systems exhibit differ- 
ent behaviors. Chemoff bounds produce better control strategies but at a higher learning 
cost. Neither technique dhcctly accesses the tradeoff between the unprovcment due to learn- 
ing and the cost to achieve that improvement. Currently we are mvestigating ways to apply 
d vision theoretic methods to resolve this tradeoff in a principled way. 
6 CONCLUSIONS 

Learning shows great promise to extend the generality and effectiveness of planning tech- 
niques. Unfortunately, many learning approaches are based on poorly understood heuristics. 
In many ckcumstances a technique designed to improve planning performance can have the 
opposite effect 

In this paper we discussed one general approach to the utility problem which gives probabil- 
istic guarantees of unprovement through leanung. Our implementation is restricted to con- 
trol rules but could be extended to other representations of control knowledge. We con- 
trasted COMPOSER with four other learning techniques -- three which do not provide 
guarantees, and one which does. The utility analysis method of PRODIGY/EBL, the nonre- 
cursive hypothesis of STATIC, and even a combination of both can produce substantial per- 
formance degradations. Gremer and Cohen 's PALO approach should yield somewhat better 
performance improvements than COMPOSER but at a substantially higher learning cost. 
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